Task 3. Class War in Titanic



In [1]:

    
DATA_FOLDER = 'Data' # Use the data folder provided in Tutorial 02 - Intro to Pandas.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from dateutil.parser import parse
from os import listdir
from os.path import isfile, join
sns.set_context('notebook')

Task 1. Compiling Ebola Data

The DATA_FOLDER/ebola folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single Dataframe. Using this DataFrame, calculate for each country, the daily average per month of new cases and deaths. Make sure you handle all the different expressions for new cases and deaths that are used in the reports.

First, we define some helpful functions that will help us during the parsing of the data.

get_files: returns all the .csv files for a given country



In [2]:

    
def get_files(country):
    path = DATA_FOLDER + "/ebola/" + country + "_data/"
    return [f for f in listdir(path) if isfile(join(path, f))]

sum_row: for a given row, returns the total value for the new cases / deaths. We first defined this function as the sum of all new cases / deaths in all provinces, but we discovered some strange data for some provinces, so we decided to only take into account the 'total' column
sum_rows: sum all the rows given in argument



In [3]:

    
def sum_row(row, total_col):
    return float(row[total_col].values[0])

def sum_rows(rows, total_col):
    tot = 0
    for row in rows:
        tot += sum_row(row, total_col)
    return tot

Now, we define for each country a function, which, for a given file, returns a dictionnary with the country, date, upper and lower bounds for the new cases, and upper and lower bounds for the new deaths. As we don't know if the new cases / deaths for the 'probable' and 'suspected' cases is reliable, we decided to create an upper bound with the sum of the 'confirmed', 'probable' and 'suspected' new cases / deaths, and a lower bound with only the 'confirmed' new cases / deaths.

The structure of these functions are the same for each country, only the name of the descrption of the data changes.



In [4]:

    
def get_row_guinea(file):
    country = 'guinea'
    date = file[:10]
    raw = pd.read_csv(DATA_FOLDER + "/ebola/" + country + "_data/" + file)
    total_col = "Totals"
    
    new_cases_lower = sum_row(raw[raw.Description == "New cases of confirmed"], total_col)
    new_cases_upper = sum_row(raw[raw.Description == "Total new cases registered so far"], total_col)
        
    new_deaths_lower = sum_row(raw[(raw.Description == "New deaths registered today (confirmed)") | (raw.Description == "New deaths registered")], total_col)
    new_deaths_upper = sum_row(raw[(raw.Description == "New deaths registered today") | (raw.Description == "New deaths registered")], total_col)
    
    return {'Country' : country, 'Date' : parse(date), 'NewCasesLower' : new_cases_lower, 'NewCasesUpper' : new_cases_upper, 'NewDeathsLower' : new_deaths_lower, 'NewDeathsUpper' : new_deaths_upper}



In [5]:

    
def get_row_liberia(file):
    country = 'liberia'
    date = file[:10]
    raw = pd.read_csv(DATA_FOLDER + "/ebola/" + country + "_data/" + file).fillna(0)
    total_col = "National"
    
    new_cases_lower = sum_row(raw[raw.Variable == "New case/s (confirmed)"], total_col)
    list_cases_upper = (["New Case/s (Suspected)", 
                        "New Case/s (Probable)",
                        "New case/s (confirmed)"])
    new_cases_upper = sum_rows([raw[raw.Variable == row] for row in list_cases_upper], total_col)
        
    new_deaths_lower = sum_row(raw[raw.Variable == "Newly reported deaths"], total_col)
    new_deaths_upper = new_deaths_lower
    
    return {'Country' : country, 'Date' : parse(date), 'NewCasesLower' : new_cases_lower, 'NewCasesUpper' : new_cases_upper, 'NewDeathsLower' : new_deaths_lower, 'NewDeathsUpper' : new_deaths_upper}

As the files for the Sierra Leone does not contain data for the new deaths, we first extract the total deaths for each day, and we will process them later to get the new deaths.



In [6]:

    
def get_row_sl(file):
    country = 'sl'
    date = file[:10]
    raw = pd.read_csv(DATA_FOLDER + "/ebola/" + country + "_data/" + file).fillna(0)
    total_col = "National"
        
    new_cases_lower = sum_row(raw[raw.variable == "new_confirmed"], total_col)
    list_cases_upper = (["new_suspected", 
                        "new_probable",
                        "new_confirmed"])
    new_cases_upper = sum_rows([raw[raw.variable == row] for row in list_cases_upper], total_col)
    
    list_death_upper = (["death_suspected", 
                        "death_probable",
                        "death_confirmed"])
    total_death_upper = sum_rows([raw[raw.variable == row] for row in list_death_upper], total_col)
    total_death_lower = sum_row(raw[raw.variable == "death_confirmed"], total_col)
    
    return {'Country' : country, 'Date' : parse(date), 'NewCasesLower' : new_cases_lower, 'NewCasesUpper' : new_cases_upper, 'TotalDeathLower' : total_death_lower, 'TotalDeathUpper' : total_death_upper}



In [7]:

    
rows_guinea = [get_row_guinea(file) for file in get_files("guinea")]



In [8]:

    
rows_liberia = [get_row_liberia(file) for file in get_files("liberia")]

We now transform the data for the Sierra Leone :

we first create a new dictionary for which the keys are date, and the values are the previously extracted values from the .csv files.
then for each value in this dictionary, we try to get the value of the day before, and perform the difference to get the new deaths of this day.



In [9]:

    
rows_sl_total_deaths = [get_row_sl(file) for file in get_files("sl")]
dic_sl_total_deaths = {}
for row in rows_sl_total_deaths:
    dic_sl_total_deaths[row['Date']] = row
    
rows_sl = []
for date, entry in dic_sl_total_deaths.items():
    date_before = date - datetime.timedelta(days=1)
    if date_before in dic_sl_total_deaths:
        
        if entry['TotalDeathUpper'] != 0 and dic_sl_total_deaths[date_before]['TotalDeathUpper'] != 0 and entry['TotalDeathLower'] != 0 and dic_sl_total_deaths[date_before]['TotalDeathLower'] != 0: 
            copy = dict(entry)
            del copy['TotalDeathUpper']
            del copy['TotalDeathLower']
            
            copy['NewDeathsUpper'] = entry['TotalDeathUpper'] - dic_sl_total_deaths[date_before]['TotalDeathUpper']
            copy['NewDeathsLower'] = entry['TotalDeathLower'] - dic_sl_total_deaths[date_before]['TotalDeathLower']

            rows_sl.append(copy)

We can now insert the data in a dataframe. For Liberia, December's data is in a completely different format so we dropped it: for instance, for some days, the new cases are the new cases for the day and for some other they are the total cases for this country.



In [10]:

    
raw_dataframe = pd.DataFrame(columns=['Country', 'Date', 'NewCasesLower', 'NewCasesUpper', 'NewDeathsLower', 'NewDeathsUpper'])
for row in rows_sl, rows_guinea:
    raw_dataframe = raw_dataframe.append(row, ignore_index = True)
for row in rows_liberia:
    if row['Date'].month != 12: #December data is erroneous
        raw_dataframe = raw_dataframe.append(row, ignore_index = True)
        
raw_dataframe









    Out[10]:







  
    
      
      Country
      Date
      NewCasesLower
      NewCasesUpper
      NewDeathsLower
      NewDeathsUpper
    
  
  
    
      0
      sl
      2014-10-07
      93.0
      118.0
      5.0
      5.0
    
    
      1
      sl
      2014-10-21
      50.0
      55.0
      4.0
      4.0
    
    
      2
      sl
      2014-11-08
      111.0
      131.0
      7.0
      7.0
    
    
      3
      sl
      2014-09-02
      31.0
      35.0
      3.0
      3.0
    
    
      4
      sl
      2014-10-06
      40.0
      40.0
      4.0
      4.0
    
    
      5
      sl
      2014-08-16
      18.0
      21.0
      10.0
      10.0
    
    
      6
      sl
      2014-11-18
      43.0
      43.0
      17.0
      17.0
    
    
      7
      sl
      2014-10-25
      70.0
      87.0
      11.0
      42.0
    
    
      8
      sl
      2014-09-10
      26.0
      31.0
      9.0
      9.0
    
    
      9
      sl
      2014-09-08
      17.0
      19.0
      5.0
      5.0
    
    
      10
      sl
      2014-10-22
      44.0
      55.0
      7.0
      7.0
    
    
      11
      sl
      2014-08-22
      56.0
      57.0
      4.0
      4.0
    
    
      12
      sl
      2014-09-24
      75.0
      82.0
      0.0
      0.0
    
    
      13
      sl
      2014-11-16
      89.0
      89.0
      17.0
      17.0
    
    
      14
      sl
      2014-09-16
      29.0
      34.0
      9.0
      9.0
    
    
      15
      sl
      2014-09-18
      33.0
      41.0
      6.0
      6.0
    
    
      16
      sl
      2014-09-27
      56.0
      59.0
      5.0
      5.0
    
    
      17
      sl
      2014-11-01
      60.0
      75.0
      7.0
      7.0
    
    
      18
      sl
      2014-09-25
      35.0
      36.0
      4.0
      4.0
    
    
      19
      sl
      2014-09-01
      9.0
      11.0
      9.0
      9.0
    
    
      20
      sl
      2014-11-24
      83.0
      115.0
      33.0
      33.0
    
    
      21
      sl
      2014-10-01
      33.0
      43.0
      -18.0
      -18.0
    
    
      22
      sl
      2014-08-23
      23.0
      27.0
      3.0
      6.0
    
    
      23
      sl
      2014-10-13
      45.0
      47.0
      5.0
      15.0
    
    
      24
      sl
      2014-10-20
      72.0
      80.0
      11.0
      11.0
    
    
      25
      sl
      2014-09-26
      23.0
      32.0
      0.0
      0.0
    
    
      26
      sl
      2014-10-11
      94.0
      125.0
      17.0
      17.0
    
    
      27
      sl
      2014-11-17
      53.0
      53.0
      0.0
      0.0
    
    
      28
      sl
      2014-11-29
      64.0
      88.0
      15.0
      15.0
    
    
      29
      sl
      2014-10-23
      60.0
      67.0
      4.0
      18.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      161
      liberia
      2014-10-28
      0.0
      55.0
      30.0
      30.0
    
    
      162
      liberia
      2014-07-10
      0.0
      3.0
      4.0
      4.0
    
    
      163
      liberia
      2014-06-22
      5.0
      10.0
      4.0
      4.0
    
    
      164
      liberia
      2014-10-11
      0.0
      31.0
      33.0
      33.0
    
    
      165
      liberia
      2014-06-28
      1.0
      9.0
      1.0
      1.0
    
    
      166
      liberia
      2014-11-19
      0.0
      36.0
      21.0
      21.0
    
    
      167
      liberia
      2014-10-24
      0.0
      35.0
      40.0
      40.0
    
    
      168
      liberia
      2014-07-07
      5.0
      7.0
      2.0
      2.0
    
    
      169
      liberia
      2014-09-28
      2.0
      24.0
      4.0
      4.0
    
    
      170
      liberia
      2014-10-03
      2.0
      58.0
      38.0
      38.0
    
    
      171
      liberia
      2014-06-24
      4.0
      6.0
      4.0
      4.0
    
    
      172
      liberia
      2014-08-28
      22.0
      61.0
      51.0
      51.0
    
    
      173
      liberia
      2014-07-17
      0.0
      10.0
      3.0
      3.0
    
    
      174
      liberia
      2014-09-08
      12.0
      65.0
      47.0
      47.0
    
    
      175
      liberia
      2014-06-29
      2.0
      2.0
      0.0
      0.0
    
    
      176
      liberia
      2014-10-09
      3.0
      60.0
      45.0
      45.0
    
    
      177
      liberia
      2014-11-15
      0.0
      39.0
      12.0
      12.0
    
    
      178
      liberia
      2014-07-20
      1.0
      3.0
      4.0
      4.0
    
    
      179
      liberia
      2014-08-17
      0.0
      24.0
      11.0
      11.0
    
    
      180
      liberia
      2014-11-23
      0.0
      7.0
      0.0
      0.0
    
    
      181
      liberia
      2014-11-24
      0.0
      25.0
      15.0
      15.0
    
    
      182
      liberia
      2014-06-17
      0.0
      2.0
      0.0
      0.0
    
    
      183
      liberia
      2014-08-12
      2.0
      60.0
      29.0
      29.0
    
    
      184
      liberia
      2014-09-16
      4.0
      99.0
      54.0
      54.0
    
    
      185
      liberia
      2014-09-02
      3.0
      76.0
      47.0
      47.0
    
    
      186
      liberia
      2014-09-23
      16.0
      88.0
      51.0
      51.0
    
    
      187
      liberia
      2014-10-25
      1.0
      31.0
      24.0
      24.0
    
    
      188
      liberia
      2014-10-21
      0.0
      28.0
      23.0
      23.0
    
    
      189
      liberia
      2014-11-20
      0.0
      40.0
      30.0
      30.0
    
    
      190
      liberia
      2014-08-04
      1.0
      10.0
      7.0
      7.0
    
  

191 rows × 6 columns



In [11]:

    
dataframe = raw_dataframe.set_index(['Country', 'Date'])



In [12]:

    
dataframe_no_day = raw_dataframe
dataframe_no_day['Year'] = raw_dataframe['Date'].apply(lambda x: x.year)
dataframe_no_day['Month'] = raw_dataframe['Date'].apply(lambda x: x.month)
final_df = dataframe_no_day[['Country', 'Year', 'Month', 'NewCasesLower', 'NewCasesUpper', 'NewDeathsLower', 'NewDeathsUpper']].groupby(['Country', 'Year', 'Month']).mean()
final_df









    Out[12]:







  
    
      
      
      
      NewCasesLower
      NewCasesUpper
      NewDeathsLower
      NewDeathsUpper
    
    
      Country
      Year
      Month
      
      
      
      
    
  
  
    
      guinea
      2014
      8
      12.400000
      25.800000
      3.400000
      3.400000
    
    
      9
      13.000000
      19.625000
      3.562500
      3.562500
    
    
      10
      6.000000
      34.000000
      15.000000
      15.000000
    
    
      liberia
      2014
      6
      2.142857
      5.714286
      2.000000
      2.000000
    
    
      7
      1.818182
      8.545455
      4.272727
      4.272727
    
    
      8
      5.444444
      37.222222
      23.222222
      23.222222
    
    
      9
      6.166667
      63.833333
      36.041667
      36.041667
    
    
      10
      1.360000
      45.560000
      28.040000
      28.040000
    
    
      11
      2.600000
      26.466667
      13.466667
      13.466667
    
    
      sl
      2014
      8
      19.266667
      26.666667
      6.400000
      6.800000
    
    
      9
      37.360000
      44.160000
      5.120000
      5.120000
    
    
      10
      59.956522
      73.652174
      13.086957
      23.391304
    
    
      11
      63.846154
      78.153846
      16.076923
      16.076923
    
    
      12
      27.000000
      39.000000
      30.000000
      30.000000

Finally, to have some final general idea for the data, we average the bounds.



In [13]:

    
s1 = final_df[['NewCasesLower', 'NewCasesUpper']].mean(axis=1)
s2 = final_df[['NewDeathsLower', 'NewDeathsUpper']].mean(axis=1)
final = pd.concat([s1, s2], axis=1)
final.columns = ['NewCasesAverage', 'NewDeathsAverage']
final









    Out[13]:







  
    
      
      
      
      NewCasesAverage
      NewDeathsAverage
    
    
      Country
      Year
      Month
      
      
    
  
  
    
      guinea
      2014
      8
      19.100000
      3.400000
    
    
      9
      16.312500
      3.562500
    
    
      10
      20.000000
      15.000000
    
    
      liberia
      2014
      6
      3.928571
      2.000000
    
    
      7
      5.181818
      4.272727
    
    
      8
      21.333333
      23.222222
    
    
      9
      35.000000
      36.041667
    
    
      10
      23.460000
      28.040000
    
    
      11
      14.533333
      13.466667
    
    
      sl
      2014
      8
      22.966667
      6.600000
    
    
      9
      40.760000
      5.120000
    
    
      10
      66.804348
      18.239130
    
    
      11
      71.000000
      16.076923
    
    
      12
      33.000000
      30.000000

Task 2. RNA Sequences

In the DATA_FOLDER/microbiome subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10^th file that describes the content of each.

Use pandas to import the first 9 spreadsheets into a single DataFrame. Then, add the metadata information from the 10^th spreadsheet as columns in the combined DataFrame. Make sure that the final DataFrame has a unique index and all the NaN values have been replaced by the tag unknown.

We load the first spreadsheet from the file's Sheet 1. Then we add a new column that is the same for all the data in this import, which corresponds to the barcode of the code.

Then we rename the columns for more clarity.



In [14]:

    
mid = pd.read_excel(DATA_FOLDER + '/microbiome/MID1.xls', sheetname='Sheet 1', header=None)
mid.fillna('unknown', inplace=True)
mid['BARCODE'] = 'MID1'
mid.columns = ['Taxon', 'Count', 'BARCODE']

Now we repeat this operation for every other spreadsheet except the metadata. At each iteration we simply concatenate the data at the end of the previous data, this accumulating all the files' data into a single dataframe. We don't care about any index right now since we will use a random one later.



In [15]:

    
for i in range(2, 10):
    midi = pd.read_excel(DATA_FOLDER + '/microbiome/MID' + str(i) + '.xls', sheetname='Sheet 1', header=None)
    midi.fillna('unknown', inplace=True)
    midi['BARCODE'] = 'MID' + str(i)
    midi.columns = ['Taxon', 'Count', 'BARCODE']
    mid = pd.concat([mid, midi])

Finally, we do a merge with the metadata. We join on the BARCODE column. This column will be the index of the metadata when we import it in this case. Finally we set the index for the three columns BARCODE, GROUP and SAMPLE which are all the columns of the metada and are unique.

The only NaN value we found was the NA value on the metadata, which may indicate that there is no sample for the first group. We replaced it anyway by unknown.



In [16]:

    
metadata = pd.read_excel(DATA_FOLDER + '/microbiome/metadata.xls', sheetname='Sheet1', index_col=0)
metadata.fillna('unknown', inplace=True)
merged = pd.merge(mid, metadata, right_index=True, left_on='BARCODE')
merged = merged.set_index(keys=['BARCODE', 'Taxon'])
merged









    Out[16]:







  
    
      
      
      Count
      GROUP
      SAMPLE
    
    
      BARCODE
      Taxon
      
      
      
    
  
  
    
      MID1
      Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera
      7
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus
      2
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus
      3
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum
      3
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella
      7
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" "Methanomicrobia" Methanosarcinales Methanosarcinaceae Methanimicrococcus
      1
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" "Methanomicrobia" Methanosarcinales Methermicoccaceae Methermicoccus
      1
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" Archaeoglobi Archaeoglobales Archaeoglobaceae Ferroglobus
      1
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" Archaeoglobi Archaeoglobales Archaeoglobaceae Geoglobus
      1
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Haloplanus
      1
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Haloquadratum
      4
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Halosimplex
      1
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Natronobacterium
      2
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Natronomonas
      4
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" Methanococci Methanococcales Methanocaldococcaceae Methanotorris
      1
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" Methanopyri Methanopyrales Methanopyraceae Methanopyrus
      12
      EXTRACTION CONTROL
      unknown
    
    
      Archaea "Euryarchaeota" Thermoplasmata Thermoplasmatales Thermoplasmatales_incertae_sedis Thermogymnomonas
      2
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Acidothermaceae Acidothermus
      2
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Beutenbergiaceae Salana
      1
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Bogoriellaceae Bogoriella
      1
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Cellulomonadaceae Tropheryma
      1
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Corynebacteriaceae Turicella
      2
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Cryptosporangiaceae Cryptosporangium
      53
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Dermabacteraceae Dermabacter
      1
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Dermabacteraceae Devriesea
      2
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Dermatophilaceae Kineosphaera
      3
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Frankineae_incertae_sedis Fodinicola
      1
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Geodermatophilaceae Blastococcus
      99
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Geodermatophilaceae Geodermatophilus
      17
      EXTRACTION CONTROL
      unknown
    
    
      Bacteria "Actinobacteria" Actinobacteria Actinomycetales Intrasporangiaceae Kribbia
      2
      EXTRACTION CONTROL
      unknown
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      MID9
      Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oceanospirillaceae Neptuniibacter
      10
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis Spongiispira
      1
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oleiphilaceae Oleiphilus
      1
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Pasteurellales Pasteurellaceae Haemophilus
      1
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Moraxellaceae Alkanindiges
      12
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Moraxellaceae Perlucidibaca
      4
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Azomonas
      805
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Azorhizophilus
      862
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Azotobacter
      12
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas
      3534
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Cycloclasticus
      4
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Sulfurivirga
      1
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Thiotrichaceae Beggiatoa
      25
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Thiotrichales_incertae_sedis Fangia
      2
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Sinobacteraceae Alkanibacter
      3
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Sinobacteraceae Sinobacter
      2
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Aquimonas
      1
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Frateuria
      3
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Ignatzschineria
      2
      Control 2
      stool
    
    
      Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Luteibacter
      1
      Control 2
      stool
    
    
      Bacteria "Synergistetes" Synergistia Synergistales Synergistaceae Aminomonas
      3
      Control 2
      stool
    
    
      Bacteria "Tenericutes" Mollicutes Anaeroplasmatales Anaeroplasmataceae Asteroleplasma
      1
      Control 2
      stool
    
    
      Bacteria "Tenericutes" Mollicutes Haloplasmatales Haloplasmataceae Haloplasma
      11
      Control 2
      stool
    
    
      Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Caldimicrobium
      1
      Control 2
      stool
    
    
      Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Thermodesulfatator
      1
      Control 2
      stool
    
    
      Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Geotoga
      43
      Control 2
      stool
    
    
      Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga
      16
      Control 2
      stool
    
    
      Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Opitutus
      1
      Control 2
      stool
    
    
      Bacteria Cyanobacteria Cyanobacteria  Chloroplast Chlorarachniophyceae
      1
      Control 2
      stool
    
    
      Bacteria Cyanobacteria Cyanobacteria  Chloroplast Streptophyta
      10
      Control 2
      stool
    
  

2396 rows × 3 columns

Task 3. Class War in Titanic

Use pandas to import the data file Data/titanic.xls. It contains data on all the passengers that travelled on the Titanic.



In [17]:

    
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')









    Out[17]:




Data frame:titanic3
1309 observations and 14 variables, maximum # NAs:1188

Name Labels Units Levels Storage NAs
pclass   3 integer    0
survived Survived double    0
name Name character    0
sex   2 integer    0
age Age Year double  263
sibsp Number of Siblings/Spouses Aboard double    0
parch Number of Parents/Children Aboard double    0
ticket Ticket Number character    0
fare Passenger Fare British Pound (\243) double    1
cabin 187 integer    0
embarked   3 integer    2
boat  28 integer    0
body Body Identification Number double 1188
home.dest Home/Destination character    0



Variable Levels
pclass 1st
2nd
3rd
sex female
male
cabin 
A10
A11
A14
A16
A18
A19
A20
A21
A23
A24
A26
A29
A31
A32
A34
A36
A5
A6
A7
A9
B10
B101
B102
B11
B18
B19
B20
B22
B24
B26
B28
B3
B30
B35
B36
B37
B38
B39
B4
B41
B42
B45
B49
B5
B50
B51 B53 B55
B52 B54 B56
B57 B59 B63 B66
B58 B60
B61
B69
B71
B73
B77
B78
B79
B80
B82 B84
B86
B94
B96 B98
C101
C103
C104
C105
C106
C110
C111
C116
C118
C123
C124
C125
C126
C128
C130
C132
C148
C2
C22 C26
C23 C25 C27
C28
C30
C31
C32
C39
C45
C46
C47
C49
C50
C51
C52
C53
C54
C55 C57
C6
C62 C64
C65
C68
C7
C70
C78
C80
C82
C83
C85
C86
C87
C89
C90
C91
C92
C93
C95
C97
C99
D
D10 D12
D11
D15
D17
D19
D20
D21
D22
D26
D28
D30
D33
D34
D35
D36
D37
D38
D40
D43
D45
D46
D47
D48
D49
D50
D56
D6
D7
D9
E10
E101
E12
E121
E17
E24
E25
E31
E33
E34
E36
E38
E39 E41
E40
E44
E45
E46
E49
E50
E52
E58
E60
E63
E67
E68
E77
E8
F
F E46
F E57
F E69
F G63
F G73
F2
F33
F38
F4
G6
T
embarked Cherbourg
Queenstown
Southampton
boat 
1
10
11
12
13
13 15
13 15 B
14
15
15 16
16
2
3
4
5
5 7
5 9
6
7
8
8 10
9
A
B
C
C D
D

For each of the following questions state clearly your assumptions and discuss your findings:

Describe the type and the value range of each attribute. Indicate and transform the attributes that can be Categorical.
Plot histograms for the travel class, embarkation port, sex and age attributes. For the latter one, use discrete decade intervals.
Calculate the proportion of passengers by cabin floor. Present your results in a pie chart.
For each travel class, calculate the proportion of the passengers that survived. Present your results in pie charts.
Calculate the proportion of the passengers that survived by travel class and sex. Present your results in a single histogram.
Create 2 equally populated age categories and calculate survival proportions by age category, travel class and sex. Present your results in a DataFrame with unique index.

1. We start by importing the data from the file.



In [18]:

    
titanic = pd.read_excel(DATA_FOLDER + '/titanic.xls', sheetname='titanic')
titanic









    Out[18]:







  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      body
      home.dest
    
  
  
    
      0
      1
      1
      Allen, Miss. Elisabeth Walton
      female
      29.0000
      0
      0
      24160
      211.3375
      B5
      S
      2
      NaN
      St Louis, MO
    
    
      1
      1
      1
      Allison, Master. Hudson Trevor
      male
      0.9167
      1
      2
      113781
      151.5500
      C22 C26
      S
      11
      NaN
      Montreal, PQ / Chesterville, ON
    
    
      2
      1
      0
      Allison, Miss. Helen Loraine
      female
      2.0000
      1
      2
      113781
      151.5500
      C22 C26
      S
      NaN
      NaN
      Montreal, PQ / Chesterville, ON
    
    
      3
      1
      0
      Allison, Mr. Hudson Joshua Creighton
      male
      30.0000
      1
      2
      113781
      151.5500
      C22 C26
      S
      NaN
      135.0
      Montreal, PQ / Chesterville, ON
    
    
      4
      1
      0
      Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
      female
      25.0000
      1
      2
      113781
      151.5500
      C22 C26
      S
      NaN
      NaN
      Montreal, PQ / Chesterville, ON
    
    
      5
      1
      1
      Anderson, Mr. Harry
      male
      48.0000
      0
      0
      19952
      26.5500
      E12
      S
      3
      NaN
      New York, NY
    
    
      6
      1
      1
      Andrews, Miss. Kornelia Theodosia
      female
      63.0000
      1
      0
      13502
      77.9583
      D7
      S
      10
      NaN
      Hudson, NY
    
    
      7
      1
      0
      Andrews, Mr. Thomas Jr
      male
      39.0000
      0
      0
      112050
      0.0000
      A36
      S
      NaN
      NaN
      Belfast, NI
    
    
      8
      1
      1
      Appleton, Mrs. Edward Dale (Charlotte Lamson)
      female
      53.0000
      2
      0
      11769
      51.4792
      C101
      S
      D
      NaN
      Bayside, Queens, NY
    
    
      9
      1
      0
      Artagaveytia, Mr. Ramon
      male
      71.0000
      0
      0
      PC 17609
      49.5042
      NaN
      C
      NaN
      22.0
      Montevideo, Uruguay
    
    
      10
      1
      0
      Astor, Col. John Jacob
      male
      47.0000
      1
      0
      PC 17757
      227.5250
      C62 C64
      C
      NaN
      124.0
      New York, NY
    
    
      11
      1
      1
      Astor, Mrs. John Jacob (Madeleine Talmadge Force)
      female
      18.0000
      1
      0
      PC 17757
      227.5250
      C62 C64
      C
      4
      NaN
      New York, NY
    
    
      12
      1
      1
      Aubart, Mme. Leontine Pauline
      female
      24.0000
      0
      0
      PC 17477
      69.3000
      B35
      C
      9
      NaN
      Paris, France
    
    
      13
      1
      1
      Barber, Miss. Ellen "Nellie"
      female
      26.0000
      0
      0
      19877
      78.8500
      NaN
      S
      6
      NaN
      NaN
    
    
      14
      1
      1
      Barkworth, Mr. Algernon Henry Wilson
      male
      80.0000
      0
      0
      27042
      30.0000
      A23
      S
      B
      NaN
      Hessle, Yorks
    
    
      15
      1
      0
      Baumann, Mr. John D
      male
      NaN
      0
      0
      PC 17318
      25.9250
      NaN
      S
      NaN
      NaN
      New York, NY
    
    
      16
      1
      0
      Baxter, Mr. Quigg Edmond
      male
      24.0000
      0
      1
      PC 17558
      247.5208
      B58 B60
      C
      NaN
      NaN
      Montreal, PQ
    
    
      17
      1
      1
      Baxter, Mrs. James (Helene DeLaudeniere Chaput)
      female
      50.0000
      0
      1
      PC 17558
      247.5208
      B58 B60
      C
      6
      NaN
      Montreal, PQ
    
    
      18
      1
      1
      Bazzani, Miss. Albina
      female
      32.0000
      0
      0
      11813
      76.2917
      D15
      C
      8
      NaN
      NaN
    
    
      19
      1
      0
      Beattie, Mr. Thomson
      male
      36.0000
      0
      0
      13050
      75.2417
      C6
      C
      A
      NaN
      Winnipeg, MN
    
    
      20
      1
      1
      Beckwith, Mr. Richard Leonard
      male
      37.0000
      1
      1
      11751
      52.5542
      D35
      S
      5
      NaN
      New York, NY
    
    
      21
      1
      1
      Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
      female
      47.0000
      1
      1
      11751
      52.5542
      D35
      S
      5
      NaN
      New York, NY
    
    
      22
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0000
      0
      0
      111369
      30.0000
      C148
      C
      5
      NaN
      New York, NY
    
    
      23
      1
      1
      Bidois, Miss. Rosalie
      female
      42.0000
      0
      0
      PC 17757
      227.5250
      NaN
      C
      4
      NaN
      NaN
    
    
      24
      1
      1
      Bird, Miss. Ellen
      female
      29.0000
      0
      0
      PC 17483
      221.7792
      C97
      S
      8
      NaN
      NaN
    
    
      25
      1
      0
      Birnbaum, Mr. Jakob
      male
      25.0000
      0
      0
      13905
      26.0000
      NaN
      C
      NaN
      148.0
      San Francisco, CA
    
    
      26
      1
      1
      Bishop, Mr. Dickinson H
      male
      25.0000
      1
      0
      11967
      91.0792
      B49
      C
      7
      NaN
      Dowagiac, MI
    
    
      27
      1
      1
      Bishop, Mrs. Dickinson H (Helen Walton)
      female
      19.0000
      1
      0
      11967
      91.0792
      B49
      C
      7
      NaN
      Dowagiac, MI
    
    
      28
      1
      1
      Bissette, Miss. Amelia
      female
      35.0000
      0
      0
      PC 17760
      135.6333
      C99
      S
      8
      NaN
      NaN
    
    
      29
      1
      1
      Bjornstrom-Steffansson, Mr. Mauritz Hakan
      male
      28.0000
      0
      0
      110564
      26.5500
      C52
      S
      D
      NaN
      Stockholm, Sweden / Washington, DC
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1279
      3
      0
      Vestrom, Miss. Hulda Amanda Adolfina
      female
      14.0000
      0
      0
      350406
      7.8542
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1280
      3
      0
      Vovk, Mr. Janko
      male
      22.0000
      0
      0
      349252
      7.8958
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1281
      3
      0
      Waelens, Mr. Achille
      male
      22.0000
      0
      0
      345767
      9.0000
      NaN
      S
      NaN
      NaN
      Antwerp, Belgium / Stanton, OH
    
    
      1282
      3
      0
      Ware, Mr. Frederick
      male
      NaN
      0
      0
      359309
      8.0500
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1283
      3
      0
      Warren, Mr. Charles William
      male
      NaN
      0
      0
      C.A. 49867
      7.5500
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1284
      3
      0
      Webber, Mr. James
      male
      NaN
      0
      0
      SOTON/OQ 3101316
      8.0500
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1285
      3
      0
      Wenzel, Mr. Linhart
      male
      32.5000
      0
      0
      345775
      9.5000
      NaN
      S
      NaN
      298.0
      NaN
    
    
      1286
      3
      1
      Whabee, Mrs. George Joseph (Shawneene Abi-Saab)
      female
      38.0000
      0
      0
      2688
      7.2292
      NaN
      C
      C
      NaN
      NaN
    
    
      1287
      3
      0
      Widegren, Mr. Carl/Charles Peter
      male
      51.0000
      0
      0
      347064
      7.7500
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1288
      3
      0
      Wiklund, Mr. Jakob Alfred
      male
      18.0000
      1
      0
      3101267
      6.4958
      NaN
      S
      NaN
      314.0
      NaN
    
    
      1289
      3
      0
      Wiklund, Mr. Karl Johan
      male
      21.0000
      1
      0
      3101266
      6.4958
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1290
      3
      1
      Wilkes, Mrs. James (Ellen Needs)
      female
      47.0000
      1
      0
      363272
      7.0000
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1291
      3
      0
      Willer, Mr. Aaron ("Abi Weller")
      male
      NaN
      0
      0
      3410
      8.7125
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1292
      3
      0
      Willey, Mr. Edward
      male
      NaN
      0
      0
      S.O./P.P. 751
      7.5500
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1293
      3
      0
      Williams, Mr. Howard Hugh "Harry"
      male
      NaN
      0
      0
      A/5 2466
      8.0500
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1294
      3
      0
      Williams, Mr. Leslie
      male
      28.5000
      0
      0
      54636
      16.1000
      NaN
      S
      NaN
      14.0
      NaN
    
    
      1295
      3
      0
      Windelov, Mr. Einar
      male
      21.0000
      0
      0
      SOTON/OQ 3101317
      7.2500
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1296
      3
      0
      Wirz, Mr. Albert
      male
      27.0000
      0
      0
      315154
      8.6625
      NaN
      S
      NaN
      131.0
      NaN
    
    
      1297
      3
      0
      Wiseman, Mr. Phillippe
      male
      NaN
      0
      0
      A/4. 34244
      7.2500
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1298
      3
      0
      Wittevrongel, Mr. Camille
      male
      36.0000
      0
      0
      345771
      9.5000
      NaN
      S
      NaN
      NaN
      NaN
    
    
      1299
      3
      0
      Yasbeck, Mr. Antoni
      male
      27.0000
      1
      0
      2659
      14.4542
      NaN
      C
      C
      NaN
      NaN
    
    
      1300
      3
      1
      Yasbeck, Mrs. Antoni (Selini Alexander)
      female
      15.0000
      1
      0
      2659
      14.4542
      NaN
      C
      NaN
      NaN
      NaN
    
    
      1301
      3
      0
      Youseff, Mr. Gerious
      male
      45.5000
      0
      0
      2628
      7.2250
      NaN
      C
      NaN
      312.0
      NaN
    
    
      1302
      3
      0
      Yousif, Mr. Wazli
      male
      NaN
      0
      0
      2647
      7.2250
      NaN
      C
      NaN
      NaN
      NaN
    
    
      1303
      3
      0
      Yousseff, Mr. Gerious
      male
      NaN
      0
      0
      2627
      14.4583
      NaN
      C
      NaN
      NaN
      NaN
    
    
      1304
      3
      0
      Zabour, Miss. Hileni
      female
      14.5000
      1
      0
      2665
      14.4542
      NaN
      C
      NaN
      328.0
      NaN
    
    
      1305
      3
      0
      Zabour, Miss. Thamine
      female
      NaN
      1
      0
      2665
      14.4542
      NaN
      C
      NaN
      NaN
      NaN
    
    
      1306
      3
      0
      Zakarian, Mr. Mapriededer
      male
      26.5000
      0
      0
      2656
      7.2250
      NaN
      C
      NaN
      304.0
      NaN
    
    
      1307
      3
      0
      Zakarian, Mr. Ortin
      male
      27.0000
      0
      0
      2670
      7.2250
      NaN
      C
      NaN
      NaN
      NaN
    
    
      1308
      3
      0
      Zimmerman, Mr. Leo
      male
      29.0000
      0
      0
      315082
      7.8750
      NaN
      S
      NaN
      NaN
      NaN
    
  

1309 rows × 14 columns

Next we can list the data types of each field.



In [19]:

    
titanic.dtypes









    Out[19]:





pclass         int64
survived       int64
name          object
sex           object
age          float64
sibsp          int64
parch          int64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

When it comes to the object fields, we can be a bit more precise. name, sec, ticket, cabin, embarked, boat and home.dex are all strings.

Next, we call the describe method to list some statistics on the data. We thus obtain the range of all of the numeric fields of our data.



In [20]:

    
titanic.describe()









    Out[20]:







  
    
      
      pclass
      survived
      age
      sibsp
      parch
      fare
      body
    
  
  
    
      count
      1309.000000
      1309.000000
      1046.000000
      1309.000000
      1309.000000
      1308.000000
      121.000000
    
    
      mean
      2.294882
      0.381971
      29.881135
      0.498854
      0.385027
      33.295479
      160.809917
    
    
      std
      0.837836
      0.486055
      14.413500
      1.041658
      0.865560
      51.758668
      97.696922
    
    
      min
      1.000000
      0.000000
      0.166700
      0.000000
      0.000000
      0.000000
      1.000000
    
    
      25%
      2.000000
      0.000000
      21.000000
      0.000000
      0.000000
      7.895800
      72.000000
    
    
      50%
      3.000000
      0.000000
      28.000000
      0.000000
      0.000000
      14.454200
      155.000000
    
    
      75%
      3.000000
      1.000000
      39.000000
      1.000000
      0.000000
      31.275000
      256.000000
    
    
      max
      3.000000
      1.000000
      80.000000
      8.000000
      9.000000
      512.329200
      328.000000

Moreover, we can also note some ranges of other fields. For example, sex has only two possible values female and male. embarked can only be S, C and Q.

For a better visual result, we decided to replace the travel classes, ports to more readable values. As we make them categorical types, the performance stays the same.



In [21]:

    
class_dic = {1 : 'First Class', 2 : 'Second Class', 3 : 'Third Class', np.nan : np.nan}
survived_dic = {0 : 'Deceased' , 1 : 'Survived', np.nan : np.nan}
emarked_dic = {'C' : 'Cherbourg', 'Q' : 'Queenstown', 'S' : 'Southampton', np.nan : np.nan}
titanic['pclass'] = titanic['pclass'].apply(lambda x: class_dic[x])

titanic['survived'] = titanic['survived'].apply(lambda x: survived_dic[x])

titanic['embarked'] = titanic['embarked'].apply(lambda x: emarked_dic[x])

Then we make categorical data as actually categorical.



In [22]:

    
titanic['pclass'] = titanic.pclass.astype('category')
titanic['survived'] = titanic.survived.astype('category')
titanic['sex'] = titanic.sex.astype('category')
titanic['embarked'] = titanic.embarked.astype('category')
titanic['cabin'] = titanic.cabin.astype('category')
titanic['boat'] = titanic.boat.astype('category')

2. We plot the histogram of the travel class.



In [23]:

    
titanic.pclass.value_counts(sort=False).plot(kind='bar')









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4715463828>

Next we plot the histogram of the three embark ports.



In [24]:

    
titanic.embarked.value_counts().plot(kind='bar')









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4715120128>

Next we plot the histogram of the sex.



In [25]:

    
titanic.sex.value_counts().plot(kind='bar')









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f47150d8588>

Next, we cut the ages data into decades and plot the histogram of the devades.



In [26]:

    
pd.cut(titanic.age, range(0,90,10)).value_counts(sort=False).plot(kind='bar')









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4714f826d8>

3. We plot the cabin floor data as a pie chart.



In [27]:

    
titanic.cabin.dropna().apply(lambda x : x[0]).value_counts(sort=False).plot(kind='pie')









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4714f1fda0>

4. Here, we plot the proportion of people that survived in the first class.



In [28]:

    
titanic[titanic.pclass == "First Class"].survived.value_counts(sort=False).plot(kind='pie')









    Out[28]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4714eee240>

Next, we plot the proportion of people that survived in the second class.



In [29]:

    
titanic[titanic.pclass == "Second Class"].survived.value_counts(sort=False).plot(kind='pie')









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4714e13e80>

Finally, we plot the proportion of people that survived in the third class.



In [30]:

    
titanic[titanic.pclass == "Third Class"].survived.value_counts(sort=False).plot(kind='pie')









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4714d8a908>

As we can see, the lower the class, the higher the probability of death.

5. Here we add new columns that will help us to calculate proportions of survived people in the last part.



In [31]:

    
titanic.insert(0, 'alive', 0)
titanic.insert(0, 'dead', 0)
titanic.insert(0, 'ratio', 0)

Here we set these new columns to appropriate values. We essentialy separate the survived columns for easier summing later on. Finnaly we slice the data to take only the columns of interest.



In [32]:

    
titanic.loc[titanic['survived'] == "Survived", 'alive'] = 1
titanic.loc[titanic['survived'] == "Deceased", 'dead'] = 1
df = titanic[['pclass', 'sex', 'alive', 'dead', 'ratio']]

We group the data by the sec and class of the passangers and we sum it. Then we have the sum of alive and dead people groupped as we wish and we can easily calculate the proportion of them that survived, which we plot as a histogram.



In [33]:

    
aggregated = df.groupby(['sex', 'pclass']).sum()
(aggregated['alive'] / (aggregated['alive'] + aggregated['dead'])).plot(kind='bar')









    Out[33]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f471510e5c0>

We can see that there is a huge difference of survival between the classes and sexes : for instance, the third class males have 7 less times chance of survival than the first class females.

6. Next we insert a new column that will be the age category of each person. Since we wan't to split the people in two equal groups based on age, we compute the median age of passangers. We also drop the passengers with an unknown age value, to avoid bad results for the median computation.



In [34]:

    
titanic.dropna(axis=0, subset=['age'], inplace=True)
titanic.insert(0, 'age_category', 0)
median = titanic['age'].median()

Next, we set the correct category to people below or above the median age. The people that have the median age are grouped with the people below it. Next we set this column as a categorical column.



In [35]:

    
titanic.loc[titanic['age'] > median, 'age_category'] = "Age > " + str(median)
titanic.loc[titanic['age'] <= median, 'age_category'] = "Age <= " + str(median)
titanic['age_category'] = titanic.age_category.astype('category')

Next, we take the columns that are of interest to us and group by age category, sec and travel class. Then we sum over these groups, obtaining the people that lived and those that died which which we can compute the proportion and display it as a dataframe.



In [36]:

    
sub = titanic[['pclass', 'sex', 'age_category', 'alive', 'dead', 'ratio']]
subagg = sub.groupby(['age_category', 'sex', 'pclass']).sum()
subagg['ratio'] = (subagg['alive'] / (subagg['alive'] + subagg['dead']))
only_ratio = subagg[['ratio']]
only_ratio









    Out[36]:







  
    
      
      
      
      ratio
    
    
      age_category
      sex
      pclass
      
    
  
  
    
      Age <= 28.0
      female
      First Class
      0.951220
    
    
      Second Class
      0.910714
    
    
      Third Class
      0.513761
    
    
      male
      First Class
      0.531250
    
    
      Second Class
      0.216216
    
    
      Third Class
      0.183036
    
    
      Age > 28.0
      female
      First Class
      0.967391
    
    
      Second Class
      0.872340
    
    
      Third Class
      0.372093
    
    
      male
      First Class
      0.302521
    
    
      Second Class
      0.083333
    
    
      Third Class
      0.144000

As before, we can see that there is a huge difference of survival between the classes and sexes. In the other hand, the age doesn't make a large difference : no matter if the passenger is above or below 28 years old, its probability of survival is more determined by its sex and class.

	Country	Date	NewCasesLower	NewCasesUpper	NewDeathsLower	NewDeathsUpper
0	sl	2014-10-07	93.0	118.0	5.0	5.0
1	sl	2014-10-21	50.0	55.0	4.0	4.0
2	sl	2014-11-08	111.0	131.0	7.0	7.0
3	sl	2014-09-02	31.0	35.0	3.0	3.0
4	sl	2014-10-06	40.0	40.0	4.0	4.0
5	sl	2014-08-16	18.0	21.0	10.0	10.0
6	sl	2014-11-18	43.0	43.0	17.0	17.0
7	sl	2014-10-25	70.0	87.0	11.0	42.0
8	sl	2014-09-10	26.0	31.0	9.0	9.0
9	sl	2014-09-08	17.0	19.0	5.0	5.0
10	sl	2014-10-22	44.0	55.0	7.0	7.0
11	sl	2014-08-22	56.0	57.0	4.0	4.0
12	sl	2014-09-24	75.0	82.0	0.0	0.0
13	sl	2014-11-16	89.0	89.0	17.0	17.0
14	sl	2014-09-16	29.0	34.0	9.0	9.0
15	sl	2014-09-18	33.0	41.0	6.0	6.0
16	sl	2014-09-27	56.0	59.0	5.0	5.0
17	sl	2014-11-01	60.0	75.0	7.0	7.0
18	sl	2014-09-25	35.0	36.0	4.0	4.0
19	sl	2014-09-01	9.0	11.0	9.0	9.0
20	sl	2014-11-24	83.0	115.0	33.0	33.0
21	sl	2014-10-01	33.0	43.0	-18.0	-18.0
22	sl	2014-08-23	23.0	27.0	3.0	6.0
23	sl	2014-10-13	45.0	47.0	5.0	15.0
24	sl	2014-10-20	72.0	80.0	11.0	11.0
25	sl	2014-09-26	23.0	32.0	0.0	0.0
26	sl	2014-10-11	94.0	125.0	17.0	17.0
27	sl	2014-11-17	53.0	53.0	0.0	0.0
28	sl	2014-11-29	64.0	88.0	15.0	15.0
29	sl	2014-10-23	60.0	67.0	4.0	18.0
...	...	...	...	...	...	...
161	liberia	2014-10-28	0.0	55.0	30.0	30.0
162	liberia	2014-07-10	0.0	3.0	4.0	4.0
163	liberia	2014-06-22	5.0	10.0	4.0	4.0
164	liberia	2014-10-11	0.0	31.0	33.0	33.0
165	liberia	2014-06-28	1.0	9.0	1.0	1.0
166	liberia	2014-11-19	0.0	36.0	21.0	21.0
167	liberia	2014-10-24	0.0	35.0	40.0	40.0
168	liberia	2014-07-07	5.0	7.0	2.0	2.0
169	liberia	2014-09-28	2.0	24.0	4.0	4.0
170	liberia	2014-10-03	2.0	58.0	38.0	38.0
171	liberia	2014-06-24	4.0	6.0	4.0	4.0
172	liberia	2014-08-28	22.0	61.0	51.0	51.0
173	liberia	2014-07-17	0.0	10.0	3.0	3.0
174	liberia	2014-09-08	12.0	65.0	47.0	47.0
175	liberia	2014-06-29	2.0	2.0	0.0	0.0
176	liberia	2014-10-09	3.0	60.0	45.0	45.0
177	liberia	2014-11-15	0.0	39.0	12.0	12.0
178	liberia	2014-07-20	1.0	3.0	4.0	4.0
179	liberia	2014-08-17	0.0	24.0	11.0	11.0
180	liberia	2014-11-23	0.0	7.0	0.0	0.0
181	liberia	2014-11-24	0.0	25.0	15.0	15.0
182	liberia	2014-06-17	0.0	2.0	0.0	0.0
183	liberia	2014-08-12	2.0	60.0	29.0	29.0
184	liberia	2014-09-16	4.0	99.0	54.0	54.0
185	liberia	2014-09-02	3.0	76.0	47.0	47.0
186	liberia	2014-09-23	16.0	88.0	51.0	51.0
187	liberia	2014-10-25	1.0	31.0	24.0	24.0
188	liberia	2014-10-21	0.0	28.0	23.0	23.0
189	liberia	2014-11-20	0.0	40.0	30.0	30.0
190	liberia	2014-08-04	1.0	10.0	7.0	7.0

			NewCasesLower	NewCasesUpper	NewDeathsLower	NewDeathsUpper
Country	Year	Month
guinea	2014	8	12.400000	25.800000	3.400000	3.400000
		9	13.000000	19.625000	3.562500	3.562500
		10	6.000000	34.000000	15.000000	15.000000
liberia	2014	6	2.142857	5.714286	2.000000	2.000000
		7	1.818182	8.545455	4.272727	4.272727
		8	5.444444	37.222222	23.222222	23.222222
		9	6.166667	63.833333	36.041667	36.041667
		10	1.360000	45.560000	28.040000	28.040000
		11	2.600000	26.466667	13.466667	13.466667
sl	2014	8	19.266667	26.666667	6.400000	6.800000
		9	37.360000	44.160000	5.120000	5.120000
		10	59.956522	73.652174	13.086957	23.391304
		11	63.846154	78.153846	16.076923	16.076923
		12	27.000000	39.000000	30.000000	30.000000

		Count	GROUP	SAMPLE
BARCODE	Taxon
MID1	Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera	7	EXTRACTION CONTROL	unknown
	Archaea "Crenarchaeota" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus	2	EXTRACTION CONTROL	unknown
	Archaea "Crenarchaeota" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus	3	EXTRACTION CONTROL	unknown
	Archaea "Crenarchaeota" Thermoprotei Thermoproteales Thermofilaceae Thermofilum	3	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" "Methanomicrobia" Methanocellales Methanocellaceae Methanocella	7	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" "Methanomicrobia" Methanosarcinales Methanosarcinaceae Methanimicrococcus	1	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" "Methanomicrobia" Methanosarcinales Methermicoccaceae Methermicoccus	1	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" Archaeoglobi Archaeoglobales Archaeoglobaceae Ferroglobus	1	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" Archaeoglobi Archaeoglobales Archaeoglobaceae Geoglobus	1	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Haloplanus	1	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Haloquadratum	4	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Halosimplex	1	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Natronobacterium	2	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" Halobacteria Halobacteriales Halobacteriaceae Natronomonas	4	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" Methanococci Methanococcales Methanocaldococcaceae Methanotorris	1	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" Methanopyri Methanopyrales Methanopyraceae Methanopyrus	12	EXTRACTION CONTROL	unknown
	Archaea "Euryarchaeota" Thermoplasmata Thermoplasmatales Thermoplasmatales_incertae_sedis Thermogymnomonas	2	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Acidothermaceae Acidothermus	2	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Beutenbergiaceae Salana	1	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Bogoriellaceae Bogoriella	1	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Cellulomonadaceae Tropheryma	1	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Corynebacteriaceae Turicella	2	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Cryptosporangiaceae Cryptosporangium	53	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Dermabacteraceae Dermabacter	1	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Dermabacteraceae Devriesea	2	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Dermatophilaceae Kineosphaera	3	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Frankineae_incertae_sedis Fodinicola	1	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Geodermatophilaceae Blastococcus	99	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Geodermatophilaceae Geodermatophilus	17	EXTRACTION CONTROL	unknown
	Bacteria "Actinobacteria" Actinobacteria Actinomycetales Intrasporangiaceae Kribbia	2	EXTRACTION CONTROL	unknown
...	...	...	...	...
MID9	Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oceanospirillaceae Neptuniibacter	10	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oceanospirillales_incertae_sedis Spongiispira	1	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Oceanospirillales Oleiphilaceae Oleiphilus	1	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Pasteurellales Pasteurellaceae Haemophilus	1	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Moraxellaceae Alkanindiges	12	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Moraxellaceae Perlucidibaca	4	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Azomonas	805	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Azorhizophilus	862	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Azotobacter	12	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas	3534	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Cycloclasticus	4	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Piscirickettsiaceae Sulfurivirga	1	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Thiotrichaceae Beggiatoa	25	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Thiotrichales Thiotrichales_incertae_sedis Fangia	2	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Sinobacteraceae Alkanibacter	3	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Sinobacteraceae Sinobacter	2	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Aquimonas	1	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Frateuria	3	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Ignatzschineria	2	Control 2	stool
	Bacteria "Proteobacteria" Gammaproteobacteria Xanthomonadales Xanthomonadaceae Luteibacter	1	Control 2	stool
	Bacteria "Synergistetes" Synergistia Synergistales Synergistaceae Aminomonas	3	Control 2	stool
	Bacteria "Tenericutes" Mollicutes Anaeroplasmatales Anaeroplasmataceae Asteroleplasma	1	Control 2	stool
	Bacteria "Tenericutes" Mollicutes Haloplasmatales Haloplasmataceae Haloplasma	11	Control 2	stool
	Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Caldimicrobium	1	Control 2	stool
	Bacteria "Thermodesulfobacteria" Thermodesulfobacteria Thermodesulfobacteriales Thermodesulfobacteriaceae Thermodesulfatator	1	Control 2	stool
	Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Geotoga	43	Control 2	stool
	Bacteria "Thermotogae" Thermotogae Thermotogales Thermotogaceae Kosmotoga	16	Control 2	stool
	Bacteria "Verrucomicrobia" Opitutae Opitutales Opitutaceae Opitutus	1	Control 2	stool
	Bacteria Cyanobacteria Cyanobacteria Chloroplast Chlorarachniophyceae	1	Control 2	stool
	Bacteria Cyanobacteria Cyanobacteria Chloroplast Streptophyta	10	Control 2	stool

Name	Labels	Units	Levels	Storage	NAs
pclass			3	integer	0
survived	Survived			double	0
name	Name			character	0
sex			2	integer	0
age	Age	Year		double	263
sibsp	Number of Siblings/Spouses Aboard			double	0
parch	Number of Parents/Children Aboard			double	0
ticket	Ticket Number			character	0
fare	Passenger Fare	British Pound (\243)		double	1
cabin			187	integer	0
embarked			3	integer	2
boat			28	integer	0
body	Body Identification Number			double	1188
home.dest	Home/Destination			character	0

Variable	Levels
pclass	1st
	2nd
	3rd
sex	female
	male
cabin
	A10
	A11
	A14
	A16
	A18
	A19
	A20
	A21
	A23
	A24
	A26
	A29
	A31
	A32
	A34
	A36
	A5
	A6
	A7
	A9
	B10
	B101
	B102
	B11
	B18
	B19
	B20
	B22
	B24
	B26
	B28
	B3
	B30
	B35
	B36
	B37
	B38
	B39
	B4
	B41
	B42
	B45
	B49
	B5
	B50
	B51 B53 B55
	B52 B54 B56
	B57 B59 B63 B66
	B58 B60
	B61
	B69
	B71
	B73
	B77
	B78
	B79
	B80
	B82 B84
	B86
	B94
	B96 B98
	C101
	C103
	C104
	C105
	C106
	C110
	C111
	C116
	C118
	C123
	C124
	C125
	C126
	C128
	C130
	C132
	C148
	C2
	C22 C26
	C23 C25 C27
	C28
	C30
	C31
	C32
	C39
	C45
	C46
	C47
	C49
	C50
	C51
	C52
	C53
	C54
	C55 C57
	C6
	C62 C64
	C65
	C68
	C7
	C70
	C78
	C80
	C82
	C83
	C85
	C86
	C87
	C89
	C90
	C91
	C92
	C93
	C95
	C97
	C99
	D
	D10 D12
	D11
	D15
	D17
	D19
	D20
	D21
	D22
	D26
	D28
	D30
	D33
	D34
	D35
	D36
	D37
	D38
	D40
	D43
	D45
	D46
	D47
	D48
	D49
	D50
	D56
	D6
	D7
	D9
	E10
	E101
	E12
	E121
	E17
	E24
	E25
	E31
	E33
	E34
	E36
	E38
	E39 E41
	E40
	E44
	E45
	E46
	E49
	E50
	E52
	E58
	E60
	E63
	E67
	E68
	E77
	E8
	F
	F E46
	F E57
	F E69
	F G63
	F G73
	F2
	F33
	F38
	F4
	G6
	T
embarked	Cherbourg
	Queenstown
	Southampton
boat
	1
	10
	11
	12
	13
	13 15
	13 15 B
	14
	15
	15 16
	16
	2
	3
	4
	5
	5 7
	5 9
	6
	7
	8
	8 10
	9
	A
	B
	C
	C D
	D

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
5	1	1	Anderson, Mr. Harry	male	48.0000	0	0	19952	26.5500	E12	S	3	NaN	New York, NY
6	1	1	Andrews, Miss. Kornelia Theodosia	female	63.0000	1	0	13502	77.9583	D7	S	10	NaN	Hudson, NY
7	1	0	Andrews, Mr. Thomas Jr	male	39.0000	0	0	112050	0.0000	A36	S	NaN	NaN	Belfast, NI
8	1	1	Appleton, Mrs. Edward Dale (Charlotte Lamson)	female	53.0000	2	0	11769	51.4792	C101	S	D	NaN	Bayside, Queens, NY
9	1	0	Artagaveytia, Mr. Ramon	male	71.0000	0	0	PC 17609	49.5042	NaN	C	NaN	22.0	Montevideo, Uruguay
10	1	0	Astor, Col. John Jacob	male	47.0000	1	0	PC 17757	227.5250	C62 C64	C	NaN	124.0	New York, NY
11	1	1	Astor, Mrs. John Jacob (Madeleine Talmadge Force)	female	18.0000	1	0	PC 17757	227.5250	C62 C64	C	4	NaN	New York, NY
12	1	1	Aubart, Mme. Leontine Pauline	female	24.0000	0	0	PC 17477	69.3000	B35	C	9	NaN	Paris, France
13	1	1	Barber, Miss. Ellen "Nellie"	female	26.0000	0	0	19877	78.8500	NaN	S	6	NaN	NaN
14	1	1	Barkworth, Mr. Algernon Henry Wilson	male	80.0000	0	0	27042	30.0000	A23	S	B	NaN	Hessle, Yorks
15	1	0	Baumann, Mr. John D	male	NaN	0	0	PC 17318	25.9250	NaN	S	NaN	NaN	New York, NY
16	1	0	Baxter, Mr. Quigg Edmond	male	24.0000	0	1	PC 17558	247.5208	B58 B60	C	NaN	NaN	Montreal, PQ
17	1	1	Baxter, Mrs. James (Helene DeLaudeniere Chaput)	female	50.0000	0	1	PC 17558	247.5208	B58 B60	C	6	NaN	Montreal, PQ
18	1	1	Bazzani, Miss. Albina	female	32.0000	0	0	11813	76.2917	D15	C	8	NaN	NaN
19	1	0	Beattie, Mr. Thomson	male	36.0000	0	0	13050	75.2417	C6	C	A	NaN	Winnipeg, MN
20	1	1	Beckwith, Mr. Richard Leonard	male	37.0000	1	1	11751	52.5542	D35	S	5	NaN	New York, NY
21	1	1	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0000	1	1	11751	52.5542	D35	S	5	NaN	New York, NY
22	1	1	Behr, Mr. Karl Howell	male	26.0000	0	0	111369	30.0000	C148	C	5	NaN	New York, NY
23	1	1	Bidois, Miss. Rosalie	female	42.0000	0	0	PC 17757	227.5250	NaN	C	4	NaN	NaN
24	1	1	Bird, Miss. Ellen	female	29.0000	0	0	PC 17483	221.7792	C97	S	8	NaN	NaN
25	1	0	Birnbaum, Mr. Jakob	male	25.0000	0	0	13905	26.0000	NaN	C	NaN	148.0	San Francisco, CA
26	1	1	Bishop, Mr. Dickinson H	male	25.0000	1	0	11967	91.0792	B49	C	7	NaN	Dowagiac, MI
27	1	1	Bishop, Mrs. Dickinson H (Helen Walton)	female	19.0000	1	0	11967	91.0792	B49	C	7	NaN	Dowagiac, MI
28	1	1	Bissette, Miss. Amelia	female	35.0000	0	0	PC 17760	135.6333	C99	S	8	NaN	NaN
29	1	1	Bjornstrom-Steffansson, Mr. Mauritz Hakan	male	28.0000	0	0	110564	26.5500	C52	S	D	NaN	Stockholm, Sweden / Washington, DC
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1279	3	0	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0000	0	0	350406	7.8542	NaN	S	NaN	NaN	NaN
1280	3	0	Vovk, Mr. Janko	male	22.0000	0	0	349252	7.8958	NaN	S	NaN	NaN	NaN
1281	3	0	Waelens, Mr. Achille	male	22.0000	0	0	345767	9.0000	NaN	S	NaN	NaN	Antwerp, Belgium / Stanton, OH
1282	3	0	Ware, Mr. Frederick	male	NaN	0	0	359309	8.0500	NaN	S	NaN	NaN	NaN
1283	3	0	Warren, Mr. Charles William	male	NaN	0	0	C.A. 49867	7.5500	NaN	S	NaN	NaN	NaN
1284	3	0	Webber, Mr. James	male	NaN	0	0	SOTON/OQ 3101316	8.0500	NaN	S	NaN	NaN	NaN
1285	3	0	Wenzel, Mr. Linhart	male	32.5000	0	0	345775	9.5000	NaN	S	NaN	298.0	NaN
1286	3	1	Whabee, Mrs. George Joseph (Shawneene Abi-Saab)	female	38.0000	0	0	2688	7.2292	NaN	C	C	NaN	NaN
1287	3	0	Widegren, Mr. Carl/Charles Peter	male	51.0000	0	0	347064	7.7500	NaN	S	NaN	NaN	NaN
1288	3	0	Wiklund, Mr. Jakob Alfred	male	18.0000	1	0	3101267	6.4958	NaN	S	NaN	314.0	NaN
1289	3	0	Wiklund, Mr. Karl Johan	male	21.0000	1	0	3101266	6.4958	NaN	S	NaN	NaN	NaN
1290	3	1	Wilkes, Mrs. James (Ellen Needs)	female	47.0000	1	0	363272	7.0000	NaN	S	NaN	NaN	NaN
1291	3	0	Willer, Mr. Aaron ("Abi Weller")	male	NaN	0	0	3410	8.7125	NaN	S	NaN	NaN	NaN
1292	3	0	Willey, Mr. Edward	male	NaN	0	0	S.O./P.P. 751	7.5500	NaN	S	NaN	NaN	NaN
1293	3	0	Williams, Mr. Howard Hugh "Harry"	male	NaN	0	0	A/5 2466	8.0500	NaN	S	NaN	NaN	NaN
1294	3	0	Williams, Mr. Leslie	male	28.5000	0	0	54636	16.1000	NaN	S	NaN	14.0	NaN
1295	3	0	Windelov, Mr. Einar	male	21.0000	0	0	SOTON/OQ 3101317	7.2500	NaN	S	NaN	NaN	NaN
1296	3	0	Wirz, Mr. Albert	male	27.0000	0	0	315154	8.6625	NaN	S	NaN	131.0	NaN
1297	3	0	Wiseman, Mr. Phillippe	male	NaN	0	0	A/4. 34244	7.2500	NaN	S	NaN	NaN	NaN
1298	3	0	Wittevrongel, Mr. Camille	male	36.0000	0	0	345771	9.5000	NaN	S	NaN	NaN	NaN
1299	3	0	Yasbeck, Mr. Antoni	male	27.0000	1	0	2659	14.4542	NaN	C	C	NaN	NaN
1300	3	1	Yasbeck, Mrs. Antoni (Selini Alexander)	female	15.0000	1	0	2659	14.4542	NaN	C	NaN	NaN	NaN
1301	3	0	Youseff, Mr. Gerious	male	45.5000	0	0	2628	7.2250	NaN	C	NaN	312.0	NaN
1302	3	0	Yousif, Mr. Wazli	male	NaN	0	0	2647	7.2250	NaN	C	NaN	NaN	NaN
1303	3	0	Yousseff, Mr. Gerious	male	NaN	0	0	2627	14.4583	NaN	C	NaN	NaN	NaN
1304	3	0	Zabour, Miss. Hileni	female	14.5000	1	0	2665	14.4542	NaN	C	NaN	328.0	NaN
1305	3	0	Zabour, Miss. Thamine	female	NaN	1	0	2665	14.4542	NaN	C	NaN	NaN	NaN
1306	3	0	Zakarian, Mr. Mapriededer	male	26.5000	0	0	2656	7.2250	NaN	C	NaN	304.0	NaN
1307	3	0	Zakarian, Mr. Ortin	male	27.0000	0	0	2670	7.2250	NaN	C	NaN	NaN	NaN
1308	3	0	Zimmerman, Mr. Leo	male	29.0000	0	0	315082	7.8750	NaN	S	NaN	NaN	NaN

	pclass	survived	age	sibsp	parch	fare	body
count	1309.000000	1309.000000	1046.000000	1309.000000	1309.000000	1308.000000	121.000000
mean	2.294882	0.381971	29.881135	0.498854	0.385027	33.295479	160.809917
std	0.837836	0.486055	14.413500	1.041658	0.865560	51.758668	97.696922
min	1.000000	0.000000	0.166700	0.000000	0.000000	0.000000	1.000000
25%	2.000000	0.000000	21.000000	0.000000	0.000000	7.895800	72.000000
50%	3.000000	0.000000	28.000000	0.000000	0.000000	14.454200	155.000000
75%	3.000000	1.000000	39.000000	1.000000	0.000000	31.275000	256.000000
max	3.000000	1.000000	80.000000	8.000000	9.000000	512.329200	328.000000

			ratio
age_category	sex	pclass
Age <= 28.0	female	First Class	0.951220
		Second Class	0.910714
		Third Class	0.513761
	male	First Class	0.531250
		Second Class	0.216216
		Third Class	0.183036
Age > 28.0	female	First Class	0.967391
		Second Class	0.872340
		Third Class	0.372093
	male	First Class	0.302521
		Second Class	0.083333
		Third Class	0.144000

Table of Contents

Task 1. Compiling Ebola Data

Task 2. RNA Sequences

Task 3. Class War in Titanic

Data frame:titanic3